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Abstract—Smart cities can use artificial neural networks 
to provide more accurate information about public 
transportation schedules, and thus help the population 
plan their day to day activities. In this context, this paper 
describes the essential steps for the acquisition and 
processing of data, and the creation of a neural network 
model capable of predicting possible delays or advances 
on bus lines in the city of Curitiba, Parana. The neural 
network considers traffic data, climate, time atrd history 
of a public transport line. The article details all phases of 
collection and treatment, as well as how information is 
inserted into the network and what are the obtained 
results. 
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I. INTRODUCTION 

Collective public transportation is of great importance to 
Brazilian cities. These transit systems provide invaluable 
access and locomotion to some of the country's poorest 
citizens. Furthermore, collective pub he transportation 
vehicles help to reduce traffic congestion, mitigate 
emissions from individual automobiles and contribute to 
an overall strategy to promote cleaner and 
environmentally-friendly cities. 

The urban public transport plays an important role in the 
current configuration of urban displacement as a means of 
transport that provides the interconnection between the 
various regions of the cities. It is an alternative to the 
reduction of serious problems found in cities such as: 
congestion, traffic accidents and environmental impacts 
[ 1 ]. 

Forecasting public transport delays can be an optimized 
tool that drivers and passengers could use to plan their 
daily tasks. This prediction can be obtained by analyzing 
data directly or indirectly linked to the line punctuality 
situation. Data collection is an important aspect of urban 
computing and is a determining factor in building smart 
cities [2]. 


This set of information could be used to create an 
artificial neural network that analyzes all this data and 
tries to find a possible connection between them, so that it 
creates an algorithm to predict situations of delay or 
advance, becoming a tool to help professionals in the area 
of data analysis and even to the userof the bus network. 

hi the scenario of the bus lines of public transport, one of 
the known issues is the compliance with the established 
schedules. Because it is a problem that in many cases is 
caused by factors that can not be controlled, it is not 
always possible to prevent it from happening. Predict 
these delays allows interested parties to have this 
information in advance and can decide how to work 
around the situation [3]. 

With population growth also increase the challenges for 
government, business and academia [4]. The analysis of 
data to create resources for intelligent cities has been the 
subject of several studies in both the academic and 
business environments. This technique of collecting and 
processing data can be of great value to companies and 
useis who could benefit from a great amount of 
information, planning and improving their activities, but 
also to the government that could benefit from the 
improvement in the service provided. Research shows 
that the greatest cause of dissatisfaction among the 
Brazilian population with public transportation are the 
problems with capillarity and frequency, slowness and 
frequent delays, which, according to the research [5], 
cause the population to use less public transportation. 

According to [6], congestion concerns all individuals. 
Brazilian metropolitan areas live a nightmare difficult to 
measure, which are urban congestion. The feeling of 
wasted time in front of a huge congestion is worrying, 
and there are few people who know how to live with this 
reality naturally, hi recent years, millions of people have 
lost money and time because of congestion [7] and there 
is a considerable increase in the price of car trips during 
congestion [8, 9]. 
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Knowing this reality this article proposes a solution of 
public utility so that the population knows of the expected 
delays of a bus line. The solution brings advantages such 
as reduction of the issues involved, and can benefit users 
of transport, service provider companies and government 
entities. 

This article is organized as follows. Section II discusses 
related work. Section III describes the data collection 
considered in this paper, examines the data model and 
how they may or may not directly influence the final 
result. Section IV describes the experiments performed 
and how the data obtained was added to the neural 
network and which tool was used for this, also shows the 
results obtained and the network responses. Section V 
concludes the paper and presents future work. 

II. RELATED WORK 

The attempt to predict possible delays of collective public 
transport vehicles has already been considered in other 
works. Some of these studies involve computational 
intelligence, but the vast majority use only historical data 
and some other technique. On the other hand, some 
studies have more similar characteristics with this work 
and also use weather and traffic data. 

In the work of Maciel [3], the performance of regression 
algorithms in historical data for the forecast of the start 
and end time of day trips is evaluated. The main idea was 
to evaluate the performance of regression algorithms and 
with them it was verified that, for both the start and end 
time of the trip, the median of the errors was 
approximately 28 and -167 seconds respectively. The 
work shows that the quality of the forecasts also changes 
over the course of the week, where the worst results were 
obtained on Monday and the best on Wednesday and 
Thursday. The same behavior of the days of the week was 
verified in the hours of the day, where the start and end 
times of the usual Brazilian work schedule obtained the 
most inconsistent results. They also considered some 
climate data and their influences. 

In the work of Moraes Filho [10], a project called 
CittaMobi is presented, which is a set of solutions that 
aims to make public transport information available to 
bus users. The application provides real-time predictions 
of the arrival of the bus, the locations of the closest 
points, together with the lines that pass through them, and 
some details related to each bus, for example, if the bus is 
adapted to the holders of special needs, or not. 

On the other hand, the work of Serafim [11] consists of an 
experiment carried out in the area of public transportation, 
with data obtained by an observer and collected with 
direct observation of the arrival of the bus. In the study. 
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data were collected for 23 days. To evaluate the 
punctuality of the bus, it was admitted that the process of 
generating a random sequence of delays, anticipations and 
certain hours on a given day must be influenced by what 
happened in the previous days and can therefore be 
described by a Markov chain. Besides the estimation, 
some simulated samples of these chains were also used in 
the work. However, it was verified that for samples of 
size up to 50 days, there is not sufficient information to 
detect a dependency stmcture, even if the practicality of 
the use of modeling of a variable through the chain was 
evidenced. 

The idea of improving the information offered to pub he 
transport users, based on information provided by other 
individuals, was worked by Lucio [12]. In the work, 
collective intelligence is used, which is described as a 
form of distributed intelligence, constantly improved by 
its users and coordinated in real time, resulting in the 
creation of knowledge through collaboration. The work 
shows how the resources provided by mobility in 
conjunction with collective intelligence can be used to 
create Intelligent Transport Systems (ITS). In this 
scenario, the data required for the creation of these 
intelligent systems are provided by the users of the public 
transportation through their mobile devices, providing the 
construction of a large collection of information of the 
transportation system through the contribution of the 
users. 

III. EXPERIMENTAL ANALYSIS 

For the prediction times of the path of a line of public 
transport in Curitiba, it is necessary to collect data about 
this line and additional data that may influence its path. 
These lines have routes that meander through the city, 
being of great use by the passengers, passing through 
terminals and streets that have a great flow of people and 
vehicles. The additional data is based on climate 
information and traffic incidents in a generic way and 
without any specific category, which, for both, has a great 
impact on the flow of vehicles. The acquisition process is 
to: identify lines, collect real-time basis and time 
schedules of the line and of each vehicle; identify and 
collect weather data from the region of Curitiba, targeted 
in temperature, humidity, wind speed and description of 
the weather; and, finally, collect traffic data based on line 
locations. 

After the data is collected, the stored information is 
analyzed. Before submitting them for processing with the 
neural network, it is necessary to identify and translate the 
information. Some of the data collected is based on 
natural language. It is not necessary to carry out a more 
advanced classification based on Natural Language 
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Processing (NLP), since the terms used are very limited. 
This classification is necessary only for the weather and 
traffic data, to simplify data entry to a model of the neural 
network. 

3.1 Data Collection 

To perform the prediction, historical data are needed, 
especially those that may have a connection with public 
transport bus delays. These data were collected in several 
ways and in several formats: weather data, history of 
delays of the chosen line, data of traffic and traffic flow in 
the region transited by the bus at the time of collection, 
among others. 

3.1.1 Climatic Data Collection 

Authors report that the adveise weather conditions cause 
significant changes in travel decisions [13]. A relationship 
between weather conditions and traffic flow is addressed 
in [14], showing a relationship between weather 
conditions and traffic speed, as well as a link between 
these conditions and the number of accidents. This set 
generates a change in traffic flow, shown in Fig. 1. 


traffic speed 



frequency and severity of accidents 


Fig. 1: The relationship between weather, road safety, 
traffic speed and traffic flow. 

Also in [14], it is shown that on rainy days the number of 
passengers in buses decreases and the number of cars on 
the streets increases. The authors point out that not only 
the number of passengeis is influenced by the weather but 
also the time it takes to complete the route to its 
destination and the time waiting for the public transport 
vehicle. In conclusion, precipitation, cloudiness, wind 
speed, high temperatures and hail can alter the intensity of 
traffic and underline the need to incorporate 
meteorological conditions into research directly or 
indirectly linked to traffic. 

Climatic conditions, along with brightness and visibility, 
and their link to traffic flow and the number of accidents 
are shown in studies in Orange County, California [15]. 
In this work, some data about the possible influence of 
the weather conditions are shown in tables, which are 
checked on some links between the traffic flow speed and 
weather conditions. 


With this in mind, some data on climatic conditions in the 
region of circulation of the bus are necessary. Climate 
information was found in the form of web service. The 
web service chosen for use in this research was the HG 
Weather - Weather Forecasting API [16] which is a 
project designed to disseminate information for free. In 
this web service, we can obtain data using: the city code 
(WOEID), Geo IP, Geolocation or by the name of the 
city. 

The mode chosen was using the WOEID code which, 
according to [17], is an acronym for Where On Earth 
Identifier, which marks the location of cities and 
identifies each with a specific code. For the city of 
Curitiba the WOEID is 455822, and this way of obtaining 
data was chosen because it does not require an access key. 
The data provided by this web service are: temperature, 
date and time of data update (refresh occurs 
approximately every 30 minutes), a code of the current 
climate condition, a broad description of the current 
climate, a reduced description of the current climate, 
current weather information for 'day' or 'night', city name 
with the code entered, air humidity, wind speed and 
sunrise and sunset times, and the general forecast for the 
next few days. 

3.1.2 Data Collection from the Bus Line 

Bus timeliness data on different days and times should be 
collected for forecasting. This collection should be 
periodic and occur long enough so that results can be 
observed. The authors suggest two years of data collected 
daily and every 3 minutes, ensuring that a different data 
will be acquired from the last one, so that the database is 
sufficient for use and extraction of useful data. It is 
suggested that one year of data could already bring 
satisfactory results. This is possible since the Curitiba 
City Hall provides documents and government 
information for web services through an action called 
Open Data Portal [18]. This data is available in open 
format for use and unrestricted editing of its users, thus 
being in the public domain and free use, and are intended 
to produce new information and digital applications for 
society. 

The service is in its first version, and it provides databases 
of the various organs of the Municipal Government of 
Curitiba. These bases are available through the web site to 
download, or via web services with direct access. The 
information available for download is updated every 
month, and can be accessed without the need for a term 
signature or personal identification, with or without 
commercial purpose. The information coming from the 
web service is released through the delivery of a 
document containing the user's login and password by 
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URBS S/A (Company of Urbanization of Curitiba), and 
has static data, which do not require frequent updating, or 
dynamic, which are updated every 2 minutes, depending 
on the type of service. For example, data from the city's 
main points of interest are static and do not require 
updating, since location data and information on the delay 
situation or not on a particular line change at all times, 
and therefore have frequent updates. 

To request access to the data of a certain line it is 
necessary to inform the code of the line, which are 3 
characters and can be found in the service itself. When 
entering the code line, the following data are available: 
prefix of the vehicle, which is the specific code of each 
vehicle in the network, the time of the update, latitude and 
longitude data in floating point, the line prefix, which is 
the code entered when requesting data, information if are 
adapted for wheelchair users (1 for yes, 0 for no), type of 
the bus, the timetable that the vehicle is performing 
(normal or Sundays and holidays), a situation of the 
vehicle timetable (late, early, on time) and the counter of 
cycles without updating vehicle information, since the 
information is updated every two minutes. At each cycle 
of two minutes without update this counter is increased 
by 1 (updated information has code 1). The line chosen 
for the work was 022 - Inter 2. 

3.1.3 Traffic Data Collection 

Congestion in the city makes everyone involved slow 
down and increase the time spent in traffic. Considering 
this, traffic data and traffic flow in the region trafficked 
by the bus are also important and should be considered. 

One way to collect this data is the Bing Transit [19] web 
service that also responds with a JSON file with some 
information about accidents or impediments in a 
rectangular area formed by two latitudes and two 
longitudes that represent the four sides of that area. The 
following syntax is used to specify this area: a south 
latitude, a west longitude, a north latitude, and an east 
longitude. The information provided after specifying the 
region is: time and type of accident or impediment (closed 
street, construction on the road, collisions of vehicles, 
fallen tree). For use in this work only the number of 
events in the area was used. 

The Web Scraping that was used in this research is a way 
of requesting data, collecting and analyzing it to extract 
desired information by writing a simple code to perform 
the task [20]. hi [21] it is said that web services are the 
standard, in fact, for data collection. However, there are 
scenarios where data is not available through web 
services and the use of Web Scraping becomes necessary. 


Science (IJAERS) [Vol-6, Issue-6, June- 2019] 

ISSN: 2349-649S(P) / 2456-1908(0) 

Web Scraping can be used on real-time map and traffic 
sites, since there are many of them that show the current 
flow of traffic in a particular location, which could be 
used at the time of collection. An example is the Google 
Maps tool [22] that reports the time between two points in 
real time and whether the traffic flow is flowing slowly or 
quickly. For use in this work it was decided to use only 
the current time between the arrival and departure points 
of the Inter 2 line, that is, when the time was higher than 
the average the flow is slow and when the arrival time is 
lower the flow it is faster and therefore faster the bus ride. 

The current time data between the starting point of the 
line and the endpoint was initially acquired using the 
Requests library which is a Python HTTP library that 
aims to make HTTP requests simpler and more human 
friendly according to documentation in [23]. One of the 
uses of the library is the return of the HTML code of the 
chosen web page and within that code the information of 
the time is in the form of text and that piece of text that is 
the number of minutes between the points is extracted. 
This value can then be used to describe the flow of 
current traffic in the region. 

3.1.4 Additional Data Collection 

A data of great importance to the network is the day of 
the week in which the collection was performed. The day 
of the week is important because on the Friday before a 
holiday, for example, there is a very different flow of 
traffic fromcommon Tuesdays. First, the day-of-the-week 
data can be obtained in Python (a programming language 
chosen for being one of the options for using Keras that 
will be used for the neural network and for having support 
for all the services used, making only one needed) using 
the Calendar library. To get the day of the week we 
should move the date to a function called "weekday" and 
it returns the day of the week from the informed date. 
Information about special dates or holidays were obtained 
using a web service called "Rest-API with Holidays from 
all cities of Brazil" [24] and in it is informed the IBGE 
(Brazilian Institute of Geography and Statistics) code of 
the chosen city and has as return the national, state and 
municipal holidays of the city in question. The code of 
Curitiba is 4106902. When collecting the data a simple 
comparison of the current day with the holidays is done to 
verify three situations and the answer is transformed into 
3 bits: the first bit for holiday eve, another for holiday 
day, and the last for a day after the holiday, being bit 1 for 
true answer and bit 0 for false. 

Other data could also influence, such as the occurrence of 
large events in the region, and even others of the human 
conviviality itself. Event information could be collected 
in digital newspapers in the region or on news websites, 
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also through Web Scraping or RSS, but have not been 
used so far. 

3.2 Pre-processing of data 

Some data such as the day of the week, the climate 
description and the bus situation, are in text format and 
should be changed to number, since the neural network 
model can use numbers as input data to become more 
optimized, hi the first case the following transformation 
was made, Sunday for number 1, Monday for number 2, 
Tuesday for number 3 and so on. For the climate 
description the following criterion was adopted, all 
possible answers were listed and for each one assigned a 
number, for example "Cloudy weather" was transformed 
into 1 and "Sunny" in number 4. For the situation of the 
bus the same technique was used, however using 3 
numbers each being 0 or 1 depending on the situation, 
delayed became 100, early 010 and on schedule became 
001. 

An example of collection is shown in Table 1 and in it the 
following data are present: day of the week, which in the 
example is the number 6 which is equivalent to a Friday, 
day and month of collection which in this case is a day 24 
of August, the hour and minute of collection, in case 
12:11, the temperature in the city at the time of collection, 
29 degrees Celsius in the example, also the description of 
the current climate, in case 4 that is "sunny", the humidity 
of the air, in the collection equal to 40%, condition slug 1 
that is equal to "clear day", then a code of the climate 
condition in question (code generated by the web service 
itself), the number of events collected in the region, 
holiday and, lastly, the current time between the start and 
end of the line, at that moment was 29 minutes. 


Data 

Value 

Weekday 

6 

Day 

24 

Month 

8 

Hour 

12 

Minute 

11 

Temp 

29 

Description 

4 

Humidity 

40 

Condition Slug 

1 


Condition Code 

32 

Incidents 

0 

Holiday eve 

0 

Holiday 

0 

After Holiday 

0 

Time 

29 


Table 1: A collection of input data held on August 24 at 
noon and eleven minutes. 

The output data is three, the first one being a bit 
representing delay or not, the second is ahead or not, and 
the last is on time or not. hi no case two of these bits can 
have the value 1, since the bus can not be delayed and 
advanced at the same time, for example, hi Table 2 we 
see an example of output data, where the condition is 100, 
that is, delay at the time of collection. 


Output Data 

Value 

Late 

1 

Early 

0 

On Time 

0 


Table 2: A collection of output data held on August 24 at 
noon and eleven minutes. 

When performing the first tests it was observed that it 
would be better to change the qualitative data also for the 
binary form, since the neural network works with weights 
and sizes when it comes to numbers. The quantitative data 
were kept in their decimal form. Leaving in the 
qualitative form might seem to the neural network that 
Monday is less than Saturday for example, or that 
description 4 is larger than description 1, which is not a 
truth, the idea that should be passed to network is another, 
it should be something like "it's Monday", yes or no. Then 
the following change was made, changing the day-of- 
week fields, description, quick description and condition 
code to a binary fonn that would be, yes or no for each 
possible case, 1 or 0, respectively. For the day of the 
week, for example, the number of the day has become 7 
values, each one being equivalent to one day of the week. 
Monday, for example, was 1000000, and Tuesday was 
0100000. This formatting was used for all cases cited. 

A data acquisition was done for 3 months, only to verify 
the operation and then continue the data collection, 
resulting in that time in 3000 data obtained. After this 
collection the data were used to create the neural network. 
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IV.EXPERIMENTS EXECUTION 

In order to implement the objective of predicting delays, 
it is necessary to predict events. Prediction is to make 
affirmations about something that will happen, usually 
based on information from the past and current state. 
Neural networks can be used for prediction, having 
advantages such as automatic learning of dependencies, 
requiring only measured data without any need to add 
more information. Moreover, the network can be trained 
from historical data, not having to be represented by an 
explicitly given model. 

4.1 Neural Network 

According to [25], Neural Networks, or Artificial Neural 
Networks, find applications in very diverse fields. By 
virtue of their ability to learn from input data, with or 
without a teacher, and by representing a technology 
rooted in various disciplines (such as neuroscience, math, 
statistics, physics, computer science, and engineering). 
Some examples of these fields are modeling, time series 
analysis, pattern recognition, signal processing, and 
control. As stated in [26], artificial neural networks can be 
considered as a methodology to solve problems 
characteristic of artificial intelligence. 

Neural networks are massive and parallel systems, 
composed of simple processing units that compute certain 
mathematical functions [27]. Using a set of examples 
presented, the networks are able to generalize the 
assimilated knowledge to a set of unknown data. They 
also have the ability to extract non-explicit characteristics 
from a set of information provided to them as examples 
[28], 

4.2 Experiment setup 

Keras is described in its documentation [29] as an open 
source neural network library written in Python. It is able 
to work with tools like Google TensorFlow [30]. 
Designed to enable rapid experimentation of deep neural 
networks, it focuses on being easy to use, modular and 
extensible. It is an open source library for numerical 
computation and machine learning [31], and used as the 
neural network of this work. 

To make use of the tools a code in Python language was 
developed, with data input and output in a Comma- 
separated values (CSV) file that allows the creation of 
tables with data separated by commas. The number of 
training times was defined, a hit and error quantity 
classifier was created and an interface showing the 
response ofthe systemto an input (late, early or on time). 

The best result was obtained without changing the 
optimizer and with 10 training periods. The model uses 
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only two layers of training being the first with the input 
data and the second with the output with 15 and 3 neurons 
each. 

The final configurations used were: two layers (at first the 
settings were "normal" in the kernel initializer option, 
"relu” in the activation option, the second the same option 
in the kernel initializer and "softmax" in the activation 
option). To compile the model the settings were: 
"categorical crossentropy" for the loss configuration and 
"sgd" optimizer. In the model training settings were: 10 
epochs, 100 batch_size andO verbose. 

4.3 Results obtained 

The network, after the training, obtained a response with 
84.59% accuracy in the validation data and 92% in the 
training data so far, that is, the network used some of the 
data collected to train and the rest to verily, where 90% 
was for training. When comparing the results obtained 
with the results collected in 84.59% of the cases, the 
network obtained a correct answer (the highest percentage 
was the correct answer). 

These data are presented as a chance to occur, for 
example, a forecast for June 20 with rain at 12:00 was 
made and the following results were obtained: 23.74% 
chances that the bus is late, 9.81% chance of can be early 
and 66.43% chances that the bus will be on time. So the 
final response from the network is that the bus will 
probably be on time on June 20 at noon. The result that 
can be verified in the day and time in question, if the 
climatic conditions are predicted correctly. 

hi view of the results presented here, the network presents 
a reasonable response considering some field tests with 
positive results and possibly when performing a larger 
data collection the network may present an even better 
response. 

V. CONCLUSION 

The population satisfaction with public services is from 
great importance for improving the quality of life, 
facilitating day-to-day living, and raising the level of 
satisfaction with the government. The area of public 
transportation has a huge problem with delays and 
requires methods that obtain good accuracy in their 
predictions. Considering this need, this work proposed an 
approach for the city of Curitiba, focused on the 
collection of information that may be directly related to 
delays. The proposed approach is based on the collection 
of data from various moments and sources in a way that it 
makes possible the use of neural networks for prediction. 


www.iiaers.com 


Page | 481 





International Journal of Advanced Engineering Research and Science (IJAERS) [Vol-6, Issue-6, June- 2019] 

https://dx.doi.ora/10.22161/iiaers.6.6.49 ISSN: 2349-649S(P) / 2456-1908(0) 


Cambridge Journal of Regions, Economy and Society, 
1(3), 477-501. 


The results achieved have been satisfactory at first and 
fromthema more in-depth research can be done, and then 
the data can be distributed to users as a way to improve 
the level of satisfaction with public transportation. 

The paper presented here exposed a methodology for 
collecting data linked to possible changes in the flow of 
buses in the city of Curitiba and how these data can be 
used to predict these variations in advance using neural 
networks, notify users and whom they care about. The 
next objective is to develop a platform in which users 
would be able to identify the bus that they will use and 
what time they want to arrive and the platform could 
notify the user of the ideal time to board the bus or show 
the user a table with the schedules for departure and 
arrival of the chosen bus. 
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